NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Lu, Ruiming; Lu, Yunchi; Jiang, Yuxuan; Xue, Guangtao; Huang, Peng (April 2025, 22nd USENIX Symposium on Networked Systems Design and Implementation)

Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.
more » « less
Free, publicly-accessible full text available April 28, 2026
One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Lu, Ruiming; Lu, Yunchi; Jiang, Yuxuan; Xue, Guangtao; Huang, Peng (April 2025, 22nd USENIX Symposium on Networked Systems Design and Implementation)

Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.
more » « less
Free, publicly-accessible full text available April 28, 2026
Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management

https://doi.org/10.1145/3600006.3613161

Gu, Jiawei T.; Sun, Xudong; Jiang, Yuxuan; Wang, Chen; Vaziri, Mandana; Legunsen, Owolabi; Xu, Tianyin (October 2023, ACM)

Cloud systems are increasingly being managed by operation programs termed operators, which automate tedious, human-based operations. Operators of modern management platforms like Kubernetes, Twine, and ECS implement declarative interfaces based on the state-reconciliation principle. An operation declares a desired system state and the operator automatically reconciles the system to that declared state. Operator correctness is critical, given the impacts on system operations—bugs in operator code put systems in undesired or error states, with severe consequences. However, validating operator correctness is challenging due to the enormous system-state space and complex operation interface. A correct operator must not only satisfy correctness properties of its own code, but it must also maintain managed systems in desired states. Unfortunately, end-to-end testing of operators significantly falls short. We present Acto, the first automatic end-to-end testing technique for cloud system operators. Acto uses a statecentric approach to test an operator together with a managed system. Acto continuously instructs an operator to reconcile a system to different states and checks if the system successfully reaches those desired states. Acto models operations as state transitions and systematically realizes state-transition sequences to exercise supported operations in different scenarios. Acto’s oracles automatically check whether a system’s state is as desired. To date, Acto has helped find 56 serious new bugs (42 were confirmed and 30 have been fixed) in eleven Kubernetes operators with few false alarms.
more » « less
Full Text Available
Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management

https://doi.org/10.1145/3600006.3613161

Gu, Jiawei Tyler; Sun, Xudong; Zhang, Wentao; Jiang, Yuxuan; Wang, Chen; Vaziri, Mandana; Legunsen, Owolabi; Xu, Tianyin (October 2023, ACM)
$g$ -factor engineering with InAsSb alloys toward zero band gap limit

https://doi.org/10.1103/PhysRevB.108.L121201

Jiang, Yuxuan; Ermolaev, Maksim; Moon, Seongphill; Kipshidze, Gela; Belenky, Gregory; Svensson, Stefan; Ozerov, Mykhaylo; Smirnov, Dmitry; Jiang, Zhigang; Suchalkin, Sergey (September 2023, Physical Review B)
Giant g-factors and fully spin-polarized states in metamorphic short-period InAsSb/InSb superlattices

https://doi.org/10.1038/s41467-022-33560-x

Jiang, Yuxuan; Ermolaev, Maksim; Kipshidze, Gela; Moon, Seongphill; Ozerov, Mykhaylo; Smirnov, Dmitry; Jiang, Zhigang; Suchalkin, Sergey (October 2022, Nature Communications)

Abstract Realizing a large Landég-factor of electrons in solid-state materials has long been thought of as a rewarding task as it can trigger abundant immediate applications in spintronics and quantum computing. Here, by using metamorphic InAsSb/InSb superlattices (SLs), we demonstrate an unprecedented high value ofg≈ 104, twice larger than that in bulk InSb, and fully spin-polarized states at low magnetic fields. In addition, we show that theg-factor can be tuned on demand from 20 to 110 via varying the SL period. The key ingredients of such a wide tunability are the wavefunction mixing and overlap between the electron and hole states, which have drawn little attention in prior studies. Our work not only establishes metamorphic InAsSb/InSb as a promising and competitive material platform for future quantum devices but also provides a new route towardg-factor engineering in semiconductor structures.
more » « less
Reachability types: tracking aliasing and separation in higher-order functional programs

https://doi.org/10.1145/3485516

Bao, Yuyan; Wei, Guannan; Bračevac, Oliver; Jiang, Yuxuan; He, Qiyang; Rompf, Tiark (October 2021, Proceedings of the ACM on Programming Languages)
null (Ed.)
Ownership type systems, based on the idea of enforcing unique access paths, have been primarily focused on objects and top-level classes. However, existing models do not as readily reflect the finer aspects of nested lexical scopes, capturing, or escaping closures in higher-order functional programming patterns, which are increasingly adopted even in mainstream object-oriented languages. We present a new type system, λ * , which enables expressive ownership-style reasoning across higher-order functions. It tracks sharing and separation through reachability sets, and layers additional mechanisms for selectively enforcing uniqueness on top of it. Based on reachability sets, we extend the type system with an expressive flow-sensitive effect system, which enables flavors of move semantics and ownership transfer. In addition, we present several case studies and extensions, including applications to capabilities for algebraic effects, one-shot continuations, and safe parallelization.
more » « less
Full Text Available
Graph IRs for Impure Higher-Order Languages: Making Aggressive Optimizations Affordable with Precise Effect Dependencies

https://doi.org/10.1145/3622813

Bračevac, Oliver; Wei, Guannan; Jia, Songlin; Abeysinghe, Supun; Jiang, Yuxuan; Bao, Yuyan; Rompf, Tiark (October 2023, Proceedings of the ACM on Programming Languages)

Graph-based intermediate representations (IRs) are widely used for powerful compiler optimizations, either interprocedurally in pure functional languages, or intraprocedurally in imperative languages. Yet so far, no suitable graph IR exists for aggressive global optimizations in languages with both effects and higher-order functions: aliasing and indirect control transfers make it difficult to maintain sufficiently granular dependency information for optimizations to be effective. To close this long-standing gap, we propose a novel typed graph IR combining a notion of reachability types with an expressive effect system to compute precise and granular effect dependencies at an affordable cost while supporting local reasoning and separate compilation. Our high-level graph IR imposes lexical structure to represent structured control flow and nesting, enabling aggressive and yet inexpensive code motion and other optimizations for impure higher-order programs. We formalize the new graph IR based on a λ-calculus with a reachability type-and-effect system along with a specification of various optimizations. We present performance case studies for tensor loop fusion, CUDA kernel fusion, symbolic execution of LLVM IR, and SQL query compilation in the Scala LMS compiler framework using the new graph IR. We observe significant speedups of up to 21x.
more » « less
Burstable Instances for Clouds: Performance Modeling, Equilibrium Analysis, and Revenue Maximization

https://doi.org/10.1109/TNET.2020.3015523

Jiang, Yuxuan; Shahrad, Mohammad; Wentzlaff, David; Tsang, Danny H.; Joe-Wong, Carlee (December 2020, IEEE/ACM Transactions on Networking)
null (Ed.)
Full Text Available
Burstable Instances for Clouds: Performance Modeling, Equilibrium Analysis, and Revenue Maximization

Jiang, Yuxuan; Shahrad, Mohammad; Wentzlaff, David; Tsang, Danny HK; Joe-Wong, Carlee (January 2019, Proceedings - IEEE INFOCOM)

Full Text Available

« Prev Next »

Search for: All records